[chore] RFC: Count PartialSuccess in exporterhelper#15280
Conversation
This RFC is to introduce proper accounting for `PartialSuccess` accounting in a way that aligns with the definition of partial success in the OTLP spec.
PartialSuccess in exporterhelperPartialSuccess in exporterhelper
Merging this PR will improve performance by 48.81%
|
| Benchmark | BASE |
HEAD |
Efficiency | |
|---|---|---|---|---|
| ⚡ | BenchmarkMemoryQueueWaitForResult |
74.9 µs | 50.3 µs | +48.81% |
| ⚡ | BenchmarkPersistentQueue |
205.2 µs | 148 µs | +38.67% |
Comparing braydonk:rfc_partial_success (613d4b5) with main (91b32ef)
Footnotes
-
76 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #15280 +/- ##
=======================================
Coverage 91.23% 91.23%
=======================================
Files 703 703
Lines 45902 45902
=======================================
Hits 41878 41878
Misses 2820 2820
Partials 1204 1204 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
|
||
| Two new error APIs will be introduced: | ||
|
|
||
| ### `Countable` error |
There was a problem hiding this comment.
I'm not sure I understand why we need this one... What use cases we want to cover where this one should be used instead of PartialSuccess?
There was a problem hiding this comment.
I suppose it's possible to reduce this just to PartialSuccessError, but since this is also going to be used by the prometheus receiver (see #14440) I thought the PartialSuccess error, which is so tied to exporter use cases, wouldn't make sense. Wanted to make it flexible so they could report any error they might want and simply wrap it in a count.
There was a problem hiding this comment.
I don't see a reason why the Prometheus receiver couldn't use a partialsuccess error, as long as we can provide the failed telemetry count somehow.
|
|
||
| #### Why does it implement status code `OK`? | ||
|
|
||
| A `PartialSuccess` is nominally a success according to the spec. As a result, when `exporterhelper` processes this it needs to know that the Go error it received doesn't actually correspond to a real failure to send data.This will allow accurate setting of the span status (it shouldn't be `Failed` upon `PartialSuccess`) and determining of `error.type` (a new custom type called `Partial_Success` would be introduced). |
There was a problem hiding this comment.
I think we need to clearly define if we want this error to be propagated by the exporter. If we agree that exporter clearly reports the failed portion as dropped in the internal telemetry and returns nil, we don't need to include GRPC status code resolution. However, in order to decide on this, we need to clearly outline how the internal telemetry will be affected in the exporter and upstream components (if we propagate)
There was a problem hiding this comment.
I think the error needs to be propogated by the exporter for sure, I can't think of any nice way around that. This means that the GRPC status code resolution is then necessary for obs_report_sender.go to count the error with an appropriate error.type and recognize not to mark the span as failed.
The rest of the error API was designed such that the error could be worked with as any other permanent error when any other part of exporterhelper other than obs_report_sender.go receive it.
|
|
||
| Two new error APIs will be introduced: | ||
|
|
||
| ### `Countable` error |
There was a problem hiding this comment.
Do we need to have this public surface?
There was a problem hiding this comment.
This is an error that can any other error can be wrapped in along with a count of failures. This will allow for cases where producers of an error want to include an amount of failures.
what is the unit for "failures"? May not be bad to have this as "PartialError" and map 1:1 the PartialSuccess.
1$ question: In case of a connector (like traces to metrics) if we fail and get back that we failed to export 2 metrics. For components prior to the trace to metrics connector that number "2" does not mean "2 spans failed". How do we treat this case?
There was a problem hiding this comment.
Do we need to have this public surface?
Do you mean does the interface need to be public? Not necessarily, I could just make a public function func GetErrorCount(err error) (errCount int, isCountable bool). I thought that was more awkward than the current API of getting the error as xconsumererror.Countable and then calling the Count method on it. But the former makes for less public surface.
There was a problem hiding this comment.
what is the unit for "failures"?
In the error itself I intentionally exclude unit. The unit would be decided by whoever is consuming the error to record failures. In my unfinished otlp exporter draft this ends up working out based on the fact that the exporter itself has a signal type, so when obsReportSender is recording failures for a logging otlp exporter, that failed_sent metric already has a unit of {log}.
There was a problem hiding this comment.
In case of a connector (like traces to metrics)
I hadn't considered connectors, since I was only thinking of exporters for this RFC. However I think most of the design will still fit in, but the big open question of how to record this for more complex scenarios like fanoutconsumer was blocking the ability to at least solve this for exporterhelper and receiverhelper, where the case is more clear-cut.
(I'm not sure I understand how error metrics are supposed to be recorded at all for fanoutconsumer scenarios, let alone being able to report partial counts, but I think if the error recording behaviour is clear for fanoutconsumer the partial part is relatively simple to slot in).
For components prior to the trace to metrics connector that number "2" does not mean "2 spans failed".
Are you talking about the scenario where something failed downstream from the metrics post-conversion? Assuming the simple scenario of 1-to-1 traces to metrics and no fanout, I see two scenarios:
- The trace/span failed to be converted to a metric or was refused by the conversion process in some way. This seems pretty clear that "1" failed item is "1 failed span in conversion" and would be 1 refused span that
obsconsumercould potentially record by recognizingxconsumererror.Countablehere. - The metric failed post-conversion by something down its pipeline. I think it's essentially impossible to determine whether a failed metric in that pipeline somehow lines up with a particular span that was converted to a metric. I think it's fair to say that the
tracestometricsconnector did its job, and considers all spans successful, and if the metric fails after successful conversion then that's for components further down the line to properly report.
I wasn't part of any connector design discussions that happened or may be happening, so the above is based on my own conclusions by reading and not by any definitive explanation given to me.
|
|
||
| A `PartialSuccess` error will wrap an internal error as permanent, implement [GRPC status code resolution](https://pkg.go.dev/google.golang.org/grpc/status#FromError) to code `OK`, and wrap it all as a `Countable` with a count of failures. | ||
|
|
||
| #### Why should the internal error be permanent? |
There was a problem hiding this comment.
@open-telemetry/technical-committee is this a correct read of the protocol definition? I personally think it is.
There was a problem hiding this comment.
Yes.
However, I think the protocol specification can be improved/fixed to allow the service to provide more information regarding the partial success, which would allow the client to retry the items that failed to be delivered if these are retriable.
There was a problem hiding this comment.
Yes, I think this is the correct read. The OTLP spec says:
The client MUST NOT retry the request when it receives a partial success response where the partial_success is populated.
The partial success response signals that some items are successfully received and some others are bad and cannot be received, so no retries are needed for either.
There was a problem hiding this comment.
and some others are bad and cannot be received, so no retries are needed for either.
@tigrannajaryan What could be a reason for being bad? Why couldn't they be received? (We scenarios, where it's not acceptable to have kind of unknown failures)
There was a problem hiding this comment.
What could be a reason for being bad?
Any violation of requirements defined in https://github.com/open-telemetry/opentelemetry-proto/tree/main/opentelemetry/proto could be the reason the data cannot be received.
bogdandrutu
left a comment
There was a problem hiding this comment.
Overall looks good, needs more polish on some edge cases.
| * Fully counting data drops throughout the pipeline | ||
| - The new `Countable` interface does open the possibility of counting partial failures at various points throughout the pipeline, which might be worth exploring down the line. The only thing I want to address right now is ensuring these are properly counted at the exporter metric level. | ||
| * Partial retries | ||
| - This is purely covering scenarios that align with the OTLP spec's definition of `PartialSuccess`, which does not come with any functionality to determine exactly which items failed |
There was a problem hiding this comment.
Given that consumererror can currently be used to trigger partial retries in the retry_sender, won't it be necessary to continue to support both:
- partial successes that can be partially retried and counted partially once retries fully fail
- partial successes that can only be counted
Otherwise, does it mean we will have behaviour where consumererror.NewTraces(err, failedTraces) indicates partial failure but it isn't counted partially?
|
|
||
| ### `PartialSuccess` error | ||
|
|
||
| A `PartialSuccess` error will wrap an internal error as permanent, implement [GRPC status code resolution](https://pkg.go.dev/google.golang.org/grpc/status#FromError) to code `OK`, and wrap it all as a `Countable` with a count of failures. |
There was a problem hiding this comment.
Why does it need to implement gRPC status code resolution? It looks like FromError looks for wrapped errors as well (using errors.As: https://github.com/grpc/grpc-go/blob/cb18228317ff523e63d931b4058b0329585b7dcd/status/status.go#L113). So if, for example, the OTLP exporter wants to provide the gRPC status code OK in its PartialSuccess error, it should just be able to use standard Go error wrapping to do so.
|
|
||
| This API may end up being useful in other places than `PartialSuccess`, such as [in `receiverhelper`](https://github.com/open-telemetry/opentelemetry-collector/issues/14440). | ||
|
|
||
| ### `PartialSuccess` error |
There was a problem hiding this comment.
Is this error meant to specifically be the OTLP "PartialSuccess" error? Or is this meant to be a collector type that means "The operation succeeded overall, but some parts failed"? If it is OTLP-specific, it shouldn't be in exporterhelper, and shouldn't be used by other components. If it is a generic "this thing partially succeeded" error, then it shouldn't do things like set gRPC status codes, etc.
Description
This RFC is to introduce proper accounting for
PartialSuccessaccounting in a way that aligns with the definition of partial success in the OTLP spec.Prior art: My first attempt to do this was in #14152, but I went back to the drawing board after realizing it was insufficient compared to the OTLP spec and there was some unrelated refactor to the
obs_report_sendercode.Link to tracking issue
#13423
#14440
AI Usage Disclosure
Minor Antigravity usage in the linked branches in the
Implementation Plansection. RFC itself was handwritten.